Modify inside the .qmd document your personal data (name and ID) located in the header of the file. Do not touch anything else in the header (note that I have included embed-resources: true so that everything is contained in a single html without extra files, and theme: [style.scss] to give a cuckoo style to the delivery with the style.scss file in the folder).
Make sure, BEFORE further editing the document, that the .qmd file is rendered correctly and the corresponding .html is generated in your local folder on your computer. The chunks (code boxes) created are either empty or incomplete, hence most of them have the #| eval: false option. Once you edit what you consider, you must change each chunck to #| eval: true (or remove it directly) to run them.
Remember that you can run chunk by chunk with the play button or run all chunks up to a given chunk (with the button to the left of the previous one)
Add in the chunck below all the packages you need
In the project folder you have the dataset
messy_covid_data.xlsx. Take a look at the{readxl}package and load the file correctly.
# A tibble: 43,301 × 32
provincia_iso fecha `0-9_H` `10-19_H` `20-29_H` `30-39_H`
<chr> <dttm> <dbl> <dbl> <dbl> <dbl>
1 A 2020-01-01 00:00:00 0 0 0 0
2 AB 2020-01-01 00:00:00 0 0 0 0
3 AL 2020-01-01 00:00:00 0 0 0 0
4 AV 2020-01-01 00:00:00 0 0 0 0
5 B 2020-01-01 00:00:00 0 0 0 0
6 BA 2020-01-01 00:00:00 0 0 0 0
7 BI 2020-01-01 00:00:00 0 0 0 0
8 BU 2020-01-01 00:00:00 0 0 0 0
9 C 2020-01-01 00:00:00 0 0 0 0
10 CA 2020-01-01 00:00:00 0 0 0 0
# ℹ 43,291 more rows
# ℹ 26 more variables: `40-49_H` <dbl>, `50-59_H` <dbl>, `60-69_H` <dbl>,
# `70-79_H` <dbl>, `80+_H` <dbl>, NC_H <dbl>, `0-9_M` <dbl>, `10-19_M` <dbl>,
# `20-29_M` <dbl>, `30-39_M` <dbl>, `40-49_M` <dbl>, `50-59_M` <dbl>,
# `60-69_M` <dbl>, `70-79_M` <dbl>, `80+_M` <dbl>, NC_M <dbl>,
# `0-9_NC` <dbl>, `10-19_NC` <dbl>, `20-29_NC` <dbl>, `30-39_NC` <dbl>,
# `40-49_NC` <dbl>, `50-59_NC` <dbl>, `60-69_NC` <dbl>, `70-79_NC` <dbl>, …
Column names code sex (H male, M female, NC as missing) and age group (0-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, ≥80 years and NC as missing).
Design a loop that goes through all rows and, except for the first two columns, converts each 0 found in the row to an
NA.
My computer doesn’t have the capacity to render this loop (it stays rendering per hours), but it would look like this (but without the #):
Perform the above action with the original table but in tidyverse mode: loops, brackets and dollars are forbidden. Make it as generally applicable as possible.
# A tibble: 43,301 × 32
provincia_iso fecha `0-9_H` `10-19_H` `20-29_H` `30-39_H`
<chr> <dttm> <dbl> <dbl> <dbl> <dbl>
1 A 2020-01-01 00:00:00 NA NA NA NA
2 AB 2020-01-01 00:00:00 NA NA NA NA
3 AL 2020-01-01 00:00:00 NA NA NA NA
4 AV 2020-01-01 00:00:00 NA NA NA NA
5 B 2020-01-01 00:00:00 NA NA NA NA
6 BA 2020-01-01 00:00:00 NA NA NA NA
7 BI 2020-01-01 00:00:00 NA NA NA NA
8 BU 2020-01-01 00:00:00 NA NA NA NA
9 C 2020-01-01 00:00:00 NA NA NA NA
10 CA 2020-01-01 00:00:00 NA NA NA NA
# ℹ 43,291 more rows
# ℹ 26 more variables: `40-49_H` <dbl>, `50-59_H` <dbl>, `60-69_H` <dbl>,
# `70-79_H` <dbl>, `80+_H` <dbl>, NC_H <dbl>, `0-9_M` <dbl>, `10-19_M` <dbl>,
# `20-29_M` <dbl>, `30-39_M` <dbl>, `40-49_M` <dbl>, `50-59_M` <dbl>,
# `60-69_M` <dbl>, `70-79_M` <dbl>, `80+_M` <dbl>, NC_M <dbl>,
# `0-9_NC` <dbl>, `10-19_NC` <dbl>, `20-29_NC` <dbl>, `30-39_NC` <dbl>,
# `40-49_NC` <dbl>, `50-59_NC` <dbl>, `60-69_NC` <dbl>, `70-79_NC` <dbl>, …
Design a function to test both methods using the
{microbenchmark}package in terms of time efficiency.
I create a random sample to try the function on to make it easier and faster
set.seed(1234)
sample_data <- data |> slice_sample(n = 1000)
#Then I compare methods
method_loop <- function(data) {
for (i in 1:nrow(data)) {
data[i, -(1:2)] <- lapply(data[i, -(1:2)], function(x) ifelse(x == 0, NA, x))
}
return(data)
}
method_tidyverse <- function(data) {
data |>
mutate(across(-c(1, 2), ~ ifelse(. == 0, NA, .)))
}
benchmark_results <- microbenchmark(
loop_method = method_loop(sample_data),
tidyverse_method = method_tidyverse(sample_data),
times = 30 #number of times each method should be run to measure its performance (it could be more than 30 but 30 is reasonable and faster)
)Unit: milliseconds
expr min lq mean median uq max neval
loop_method 468.5942 476.9987 517.716480 482.7577 535.5714 764.3472 30
tidyverse_method 5.4045 5.7116 6.101987 5.8884 6.1217 10.9894 30
ANSWER: When comparing the two methods on a random sample of 1000 cases, we see that the loop method takes much longer (time) than the tidyverse method. Specifically, with the loop method it takes ~800 milliseconds on average and the tidyverse method takes ~8 milliseconds on average (each time it is rendered the values vary a little)
Reasons why the data is not tidydata and converts to tidydata, deleting rows as appropriate.
tidy_covid <- data |>
pivot_longer(
cols = -c(provincia_iso, fecha),
names_to = "age_sexo",
values_to = "cases"
) |>
separate(
age_sexo,
into = c("age_group", "sex"),
sep = "_",
remove = FALSE
) |>
select(-age_sexo)
print (tidy_covid, width = Inf) #to show all columns# A tibble: 1,299,030 × 5
provincia_iso fecha age_group sex cases
<chr> <dttm> <chr> <chr> <dbl>
1 A 2020-01-01 00:00:00 0-9 H NA
2 A 2020-01-01 00:00:00 10-19 H NA
3 A 2020-01-01 00:00:00 20-29 H NA
4 A 2020-01-01 00:00:00 30-39 H NA
5 A 2020-01-01 00:00:00 40-49 H NA
6 A 2020-01-01 00:00:00 50-59 H NA
7 A 2020-01-01 00:00:00 60-69 H NA
8 A 2020-01-01 00:00:00 70-79 H NA
9 A 2020-01-01 00:00:00 80+ H NA
10 A 2020-01-01 00:00:00 NC H NA
# ℹ 1,299,020 more rows
ANSWER: The original data is not tidy because of some reasons: First of all, each variable is not a single column (f.e: age is not a variable, we have more: 0-9_H, 10-19_H, etc). Also, as we see in this example, we have a lot of columns where gender is mixed with the age. So, we don’t have a column for “age” and another for “gender”, so again each variable is not a single column. Also, there are missing rows for each unique observation: Each combination of province, date, age, gender, and category should be a separate row, otherwise we will have missing values for complex combinations like “for X province, Age Y and Gender Z we don’t have value”
One of the columns we have coded sometimes as
thing_thing, sometimes asthing+_thing, sometimes asNC_thing. Try to separate that column to generate three new columnsage_inf,age_upandsexproperly. For example, if I have10-19_HI will have to send the 10 to one column, the 19 to another and H to another; if I have80+_HI will have to send 80 to one, NA to another (there is no upper bound) and H to another; if I haveNC_Hyou will have to have NA, NA and H.
tidy_covid <- data |>
pivot_longer(
cols = -c(provincia_iso, fecha), # Keep these columns
names_to = "age_sex",
values_to = "cases"
) |>
mutate(
age_inf = str_extract(age_sex, "^[0-9]+"), # we create age_inf
age_up = str_extract(age_sex, "(?<=-)[0-9]+"), #we create age_up
sex = str_extract(age_sex, "(?<=_)[A-Za-z]+"), #we create sex
age_inf = as.numeric(age_inf),
age_up = as.numeric(age_up)
) |>
#special cases like "80+" and "NC"
mutate(
age_up = ifelse(str_detect(age_sex, "\\+"), NA, age_up), #NA if the format includes "+"
age_inf = ifelse(str_detect(age_sex, "^NC"), NA, age_inf), #NA if it's "NC"
age_up = ifelse(str_detect(age_sex, "^NC"), NA, age_up), #NA if it's "NC"
sex = ifelse(sex == "NC", NA, sex) #NA if it's "NC"
)
print (tidy_covid, width = Inf)# A tibble: 1,299,030 × 7
provincia_iso fecha age_sex cases age_inf age_up sex
<chr> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 A 2020-01-01 00:00:00 0-9_H NA 0 9 H
2 A 2020-01-01 00:00:00 10-19_H NA 10 19 H
3 A 2020-01-01 00:00:00 20-29_H NA 20 29 H
4 A 2020-01-01 00:00:00 30-39_H NA 30 39 H
5 A 2020-01-01 00:00:00 40-49_H NA 40 49 H
6 A 2020-01-01 00:00:00 50-59_H NA 50 59 H
7 A 2020-01-01 00:00:00 60-69_H NA 60 69 H
8 A 2020-01-01 00:00:00 70-79_H NA 70 79 H
9 A 2020-01-01 00:00:00 80+_H NA 80 NA H
10 A 2020-01-01 00:00:00 NC_H NA NA NA H
# ℹ 1,299,020 more rows
ANSWER: With this code, what we do is first create a column for age_sex and another for cases to bind all the data so as not to have a lot of innecesary columns. Secondly, we divide the age_sex column (that we just created) into three columns: age_inf, age_up and sex, making sure that the age remains as numerical values. Finally, for special cases such as “80+” or “NC” we set them to missing (NA) I also leave the variable age_sex because we’ll need it in a future exercise to extract the age group
Add a new variable
month_yearto the table that encodes the month and year (for example, any day in January 2020 will be something similar to “1-2020” and any day in February 2021 will be “2-2021”).
tidy_covid <- tidy_covid |>
mutate(
month_year = paste0(month(fecha), "-", year(fecha)))
print (tidy_covid, width = Inf)# A tibble: 1,299,030 × 8
provincia_iso fecha age_sex cases age_inf age_up sex
<chr> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 A 2020-01-01 00:00:00 0-9_H NA 0 9 H
2 A 2020-01-01 00:00:00 10-19_H NA 10 19 H
3 A 2020-01-01 00:00:00 20-29_H NA 20 29 H
4 A 2020-01-01 00:00:00 30-39_H NA 30 39 H
5 A 2020-01-01 00:00:00 40-49_H NA 40 49 H
6 A 2020-01-01 00:00:00 50-59_H NA 50 59 H
7 A 2020-01-01 00:00:00 60-69_H NA 60 69 H
8 A 2020-01-01 00:00:00 70-79_H NA 70 79 H
9 A 2020-01-01 00:00:00 80+_H NA 80 NA H
10 A 2020-01-01 00:00:00 NC_H NA NA NA H
month_year
<chr>
1 1-2020
2 1-2020
3 1-2020
4 1-2020
5 1-2020
6 1-2020
7 1-2020
8 1-2020
9 1-2020
10 1-2020
# ℹ 1,299,020 more rows
Import from wikipedia (using code) https://es.wikipedia.org/wiki/ISO_3166-2:ES#Provincias the table containing the ISO codes for each province.
I had problems with the variable “Nombre de la subdivisión en la ISO[1]” in future exercises, so I try to change it and rename it
colnames(prop_ISO) <- gsub(" ", "_", colnames(prop_ISO))
colnames(prop_ISO) <- gsub("[[:punct:]]", "", colnames(prop_ISO))
prop_ISO <- prop_ISO |>
rename(provincia = NombredelasubdivisiónenlaISO1,
CCAA = Comunidadautónoma)
print (prop_ISO)# A tibble: 50 × 3
Código provincia CCAA
<chr> <chr> <chr>
1 ES-C A Coruña (La Coruña)[nota 1] GA
2 ES-VI[nota 3] Araba/Álava[nota 4] PV
3 ES-AB Albacete CM
4 ES-A Alacant/Alicante[nota 2] VC
5 ES-AL Almería AN
6 ES-O[nota 3] Asturias AS
7 ES-AV Ávila CL
8 ES-BA Badajoz EX
9 ES-PM[nota 3] Balears (Baleares)[nota 1] IB
10 ES-B Barcelona (Barcelona)[nota 1] CT
# ℹ 40 more rows
Some provinces and codes have added [note x], so we cleant it to do a clean join in the next Exercise
prop_ISO <- prop_ISO |>
mutate(
provincia = gsub("\\[nota \\d+\\]?", "", provincia),
provincia = gsub("\\s?\\(.*?\\)", "", provincia),
provincia = trimws(provincia),
Código = gsub("\\[nota \\d+\\]?", "", Código),
Código = gsub("\\s?\\(.*?\\)", "", Código),
Código = trimws(Código),
Código = gsub("ES-", "", Código))
print (prop_ISO)# A tibble: 50 × 3
Código provincia CCAA
<chr> <chr> <chr>
1 C A Coruña GA
2 VI Araba/Álava PV
3 AB Albacete CM
4 A Alacant/Alicante VC
5 AL Almería AN
6 O Asturias AS
7 AV Ávila CL
8 BA Badajoz EX
9 PM Balears IB
10 B Barcelona CT
# ℹ 40 more rows
Preprocess as you consider the previous table to be able to join that table to the
tidy_covidtable.
Before joining the databases we have to check if the values of Código/provincia_iso match
We do a left join including Ceuta, Melilla and NC
# Join (we add CE, ML and NC):
final_data <- tidy_covid |>
left_join(prop_ISO, by = c("provincia_iso" = "Código")) |>
mutate(
provincia = case_when(
provincia_iso == "CE" ~ "Ceuta",
provincia_iso == "ML" ~ "Melilla",
provincia_iso == "NC" ~ "NC",
TRUE ~ provincia
),
CCAA = case_when(
provincia_iso == "CE" ~ "CE",
provincia_iso == "ML" ~ "ML",
provincia_iso == "NC" ~ "NC",
TRUE ~ CCAA
)) Before joining the tables, I have checked that the values of the province codes coincide, and if they don’t coincide I added the corresponding values (specifically in Ceuta, Melilla and NC - No contesta). [I could also have done it with a full_join but then there was a new observation at the end about Navarra that should be eliminated. I didn’t do a inner_join because I would have lost the data for Ceuta, Melilla and NC] In exercise 11 we will be asked to add the values for Navarra to completely complete the database
# A tibble: 1,299,030 × 10
provincia_iso fecha age_sex cases age_inf age_up sex
<chr> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 A 2020-01-01 00:00:00 0-9_H NA 0 9 H
2 A 2020-01-01 00:00:00 10-19_H NA 10 19 H
3 A 2020-01-01 00:00:00 20-29_H NA 20 29 H
4 A 2020-01-01 00:00:00 30-39_H NA 30 39 H
5 A 2020-01-01 00:00:00 40-49_H NA 40 49 H
6 A 2020-01-01 00:00:00 50-59_H NA 50 59 H
7 A 2020-01-01 00:00:00 60-69_H NA 60 69 H
8 A 2020-01-01 00:00:00 70-79_H NA 70 79 H
9 A 2020-01-01 00:00:00 80+_H NA 80 NA H
10 A 2020-01-01 00:00:00 NC_H NA NA NA H
month_year provincia CCAA
<chr> <chr> <chr>
1 1-2020 Alacant/Alicante VC
2 1-2020 Alacant/Alicante VC
3 1-2020 Alacant/Alicante VC
4 1-2020 Alacant/Alicante VC
5 1-2020 Alacant/Alicante VC
6 1-2020 Alacant/Alicante VC
7 1-2020 Alacant/Alicante VC
8 1-2020 Alacant/Alicante VC
9 1-2020 Alacant/Alicante VC
10 1-2020 Alacant/Alicante VC
# ℹ 1,299,020 more rows
Using the previous group variable
month_yearobtain a summary that returns in a tibble, for each province and each month-year, the daily average cases (regardless of age and sex).
summary_cases <- final_data |>
group_by(provincia_iso, month_year) |>
summarise(
total_cases = sum(cases, na.rm = TRUE),
num_days = n_distinct(fecha),
daily_cases = total_cases / num_days,
.groups = "drop" #Avoid the groups from being there after summarizing
)
print (summary_cases, width = Inf)# A tibble: 1,431 × 5
provincia_iso month_year total_cases num_days daily_cases
<chr> <chr> <dbl> <int> <dbl>
1 A 1-2020 0 31 0
2 A 1-2021 74835 31 2414.
3 A 1-2022 172951 31 5579.
4 A 10-2020 10364 31 334.
5 A 10-2021 2081 31 67.1
6 A 11-2020 14139 30 471.
7 A 11-2021 6108 30 204.
8 A 12-2020 18136 31 585.
9 A 12-2021 49708 31 1603.
10 A 2-2020 0 29 0
# ℹ 1,421 more rows
ANSWER:First we group by province and date since this way we can do specific calculations in those groups. Once grouped, we calculate the average number of cases per day (total cases/number of days). We use .groups = “drop” so that the data frame returns to the original state (without groups).
Design a function
summ_by_dates_prov()that, given a vector of ISO codes of provinces, and a table with an id of province, another of date (we don’t care about the month-year because we are going to create it inside the function) and another of cases, returns the average of daily cases that there were each month-year (regardless of sex or age) in the provided provinces. Apply after the function to the ISO codes of theprov_allowedvector below and check that it gives you the same as before (it should…)
prov_allowed <- c("M", "B", "V", "SE", "Z", "MA")
summ_by_dates_prov <- function(prov_codes, data_table) {
data_table |>
filter(provincia_iso %in% prov_codes) |> #Filter provinces that coincide with prov_codes
mutate(month_year = format(fecha, "%m-%Y")) |>
group_by(provincia_iso, month_year) |> #Calculate the total cases and the number of unique days per group
summarise(
total_cases = sum(cases, na.rm = TRUE),
num_days = n_distinct(fecha),
daily_avg_cases = total_cases / num_days,
.groups = "drop"
)
}
result <- summ_by_dates_prov(prov_allowed, final_data)
print (result, width = Inf)# A tibble: 162 × 5
provincia_iso month_year total_cases num_days daily_avg_cases
<chr> <chr> <dbl> <int> <dbl>
1 B 01-2020 1 31 0.0323
2 B 01-2021 74662 31 2408.
3 B 01-2022 633958 31 20450.
4 B 02-2020 6 29 0.207
5 B 02-2021 35824 28 1279.
6 B 02-2022 127595 28 4557.
7 B 03-2020 19365 31 625.
8 B 03-2021 29919 31 965.
9 B 03-2022 55939 27 2072.
10 B 04-2020 19910 30 664.
# ℹ 152 more rows
ANSWER: We get the same results as before for these provinces (prov_allowed). To check it in the previous exercise we have to see the results for the provincia_iso = “M”, “B”, “V”, “SE”, “Z”, “MA”, and the results are the same. In this code we specify a function that filters the data by a specific set of provinces, calculates the daily average of cases for each month and province, and returns a summary with these averages. Then, we apply it to our database and the provinces we want.
Run the code you consider to properly recode the province ISO codes (right now Navarra is as
NA; look for what should be missing and fix it).
final_data <- final_data |>
mutate(
provincia_iso = ifelse(is.na(provincia_iso), "NA", provincia_iso),
provincia = ifelse(provincia_iso == "NA", "Navarra", provincia),
CCAA = ifelse(provincia_iso == "NA", "NA", CCAA)
)
print (final_data, width = Inf)# A tibble: 1,299,030 × 10
provincia_iso fecha age_sex cases age_inf age_up sex
<chr> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 A 2020-01-01 00:00:00 0-9_H NA 0 9 H
2 A 2020-01-01 00:00:00 10-19_H NA 10 19 H
3 A 2020-01-01 00:00:00 20-29_H NA 20 29 H
4 A 2020-01-01 00:00:00 30-39_H NA 30 39 H
5 A 2020-01-01 00:00:00 40-49_H NA 40 49 H
6 A 2020-01-01 00:00:00 50-59_H NA 50 59 H
7 A 2020-01-01 00:00:00 60-69_H NA 60 69 H
8 A 2020-01-01 00:00:00 70-79_H NA 70 79 H
9 A 2020-01-01 00:00:00 80+_H NA 80 NA H
10 A 2020-01-01 00:00:00 NC_H NA NA NA H
month_year provincia CCAA
<chr> <chr> <chr>
1 1-2020 Alacant/Alicante VC
2 1-2020 Alacant/Alicante VC
3 1-2020 Alacant/Alicante VC
4 1-2020 Alacant/Alicante VC
5 1-2020 Alacant/Alicante VC
6 1-2020 Alacant/Alicante VC
7 1-2020 Alacant/Alicante VC
8 1-2020 Alacant/Alicante VC
9 1-2020 Alacant/Alicante VC
10 1-2020 Alacant/Alicante VC
# ℹ 1,299,020 more rows
ANSWER:In addition to putting “NA” in those that were NA (missings), we also added the corresponding values for Navarra in the province and CCAA. Now we have the database with all the provinces included.
With the database generated in the previous exercise, calculate the proportion of cases with unknown sex. Do the same with unknown province. Eliminate such records if the number of cases represents less than 1% (for each).
#Unknown sex and provinces
final_data |>
mutate(
unknown_sex = is.na(sex),
unknown_province = is.na(final_data$provincia_iso) | final_data$provincia_iso == "NC"
) |>
summarise(
total_cases = sum(cases, na.rm = TRUE),
unknown_sex_cases = sum(cases[unknown_sex], na.rm = TRUE),
unknown_province_cases = sum(cases[unknown_province], na.rm = TRUE),
prop_unknown_sex = unknown_sex_cases / total_cases,
prop_unknown_province = unknown_province_cases / total_cases
) |>
print()# A tibble: 1 × 5
total_cases unknown_sex_cases unknown_province_cases prop_unknown_sex
<dbl> <dbl> <dbl> <dbl>
1 11585186 9033 39952 0.000780
# ℹ 1 more variable: prop_unknown_province <dbl>
ANSWER: 0,000779% of all the cases have unknown sex and 0,003448% of the cases have unknown provinces.
Due to the fact that in both cases (sex and provinces), the number of cases represents less than 1% (for each) in the database, we just eliminate them
Create a new variable called
cum_casescontaining the accumulated cases for each date, disaggregated by province, age group and sex.
# we create again the variable "age_group"
final_data <- final_data |>
separate(
age_sex,
into = c("age_group", "sexo"),
sep = "_",
remove = FALSE) |>
select (-sexo, -age_sex)
final_data <- final_data |>
group_by(provincia_iso, age_group, sex, fecha) |>
mutate(
cum_cases = cumsum(replace_na(cases, 0)) # reeplace NA with 0 only in the cumulative sum
) |>
ungroup()
print (final_data, width = Inf)# A tibble: 866,020 × 11
provincia_iso fecha age_group cases age_inf age_up sex
<chr> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 A 2020-01-01 00:00:00 0-9 NA 0 9 H
2 A 2020-01-01 00:00:00 10-19 NA 10 19 H
3 A 2020-01-01 00:00:00 20-29 NA 20 29 H
4 A 2020-01-01 00:00:00 30-39 NA 30 39 H
5 A 2020-01-01 00:00:00 40-49 NA 40 49 H
6 A 2020-01-01 00:00:00 50-59 NA 50 59 H
7 A 2020-01-01 00:00:00 60-69 NA 60 69 H
8 A 2020-01-01 00:00:00 70-79 NA 70 79 H
9 A 2020-01-01 00:00:00 80+ NA 80 NA H
10 A 2020-01-01 00:00:00 NC NA NA NA H
month_year provincia CCAA cum_cases
<chr> <chr> <chr> <dbl>
1 1-2020 Alacant/Alicante VC 0
2 1-2020 Alacant/Alicante VC 0
3 1-2020 Alacant/Alicante VC 0
4 1-2020 Alacant/Alicante VC 0
5 1-2020 Alacant/Alicante VC 0
6 1-2020 Alacant/Alicante VC 0
7 1-2020 Alacant/Alicante VC 0
8 1-2020 Alacant/Alicante VC 0
9 1-2020 Alacant/Alicante VC 0
10 1-2020 Alacant/Alicante VC 0
# ℹ 866,010 more rows
ANSWER: The code groups the records by province, date, age group and sex (we first created the “age group” variable). It then calculates the cumulative sum of cases (cum_cases) within each group (we substitute with 0 the “NA” in cases so there are no problems in the sum. Finally, it ungroups the data to prevent them from remaining grouped after the operation.
What were the 7 provinces with the most cases throughout the pandemic? And the 5 provinces with the fewest deaths? And if we disaggregate by sex? We don’t know the number of deaths so we only do the part of the exercise that we can do
province_cases <- final_data |>
group_by(provincia_iso) |>
summarise(total_cases = sum(cases, na.rm = TRUE)) |>
arrange(desc(total_cases))
province_cases |>
head(7)# A tibble: 7 × 2
provincia_iso total_cases
<chr> <dbl>
1 B 1756033
2 M 1640978
3 V 728118
4 A 467584
5 MU 395267
6 BI 335126
7 Z 296091
ANSWER: The top 7 provinces with more cases are Barcelona, Madrid, Valencia, Alicante, Murcia, Biscaia and Zaragoza. This makes sense because they are provinces with a large number of inhabitants, so there will also be a large number of infected people.
And if we desaggregate by sex:
province_sex_cases <- final_data |>
group_by(provincia_iso, sex) |>
summarise(total_cases = sum(cases, na.rm = TRUE)) |>
arrange(desc(total_cases))
province_sex_cases |>
head(7)# A tibble: 7 × 3
# Groups: provincia_iso [4]
provincia_iso sex total_cases
<chr> <chr> <dbl>
1 B M 932573
2 M M 876472
3 B H 823460
4 M H 764506
5 V M 383696
6 V H 344422
7 A M 247004
ANSWER: The top 7 observations with more cases grouped by sex are also Barcelona, Madrid, Valencia and Alicante, the same as the exercise before.
Use the
{datapasta}package to import the population table of the provinces by copying from https://www.ine.es/jaxiT3/Datos.htm?t=2852. Incorporate this info into the table as you see fit.
In the menu I used “Addins” and inside the datapasta package it appeared “paste as tribble” and generated this code:
library(datapasta)
population <- tibble::tribble(
~provincia, ~poblacion,
"Total", 47385107,
"Albacete", 386464,
"Alacant/Alicante", 1881762,
"Almería", 731792,
"Araba/Álava", 333626,
"Asturias", 1011792,
"Ávila", 158421,
"Badajoz", 669943,
"Balears Illes", 1173008,
"Barcelona", 5714730,
"Bizkaia", 1154334,
"Burgos", 356055,
"Cáceres", 389558,
"Cádiz", 1245960,
"Cantabria", 584507,
"Castellón/Castelló", 587064,
"Ciudad Real", 492591,
"Córdoba", 776789,
"Coruña A", 1120134,
"Cuenca", 195516,
"Gipuzkoa", 726033,
"Girona", 786596,
"Granada", 921338,
"Guadalajara", 265588,
"Huelva", 525835,
"Huesca", 224264,
"Jaén", 627190,
"León", 451706,
"Lleida", 439727,
"Lugo", 326013,
"Madrid", 6751251,
"Málaga", 1695651,
"Murcia", 1518486,
"Navarra", 661537,
"Ourense", 305223,
"Palencia", 159123,
"Palmas Las", 1128539,
"Pontevedra", 944275,
"Rioja La", 319796,
"Salamanca", 327338,
"Santa Cruz de Tenerife", 1044405,
"Segovia", 153663,
"Sevilla", 1947852,
"Soria", 88747,
"Tarragona", 822309,
"Teruel", 134545,
"Toledo", 709403,
"Valencia/València", 2589312,
"Valladolid", 519361,
"Zamora", 168725,
"Zaragoza", 967452,
"Ceuta", 83517,
"Melilla", 86261
)I was having trouble using the “datapaste” package so I generated the code from Addins and got the first code that uses tibble:tribble. Then I do the join with the database final_data
Before joining the databases we have to check if the values match
final_data_vals <- unique(final_data$provincia)
population_vals <- unique(population$provincia)
# Values of final_data that aren't in population: A Coruña, Las Palmas, NC, València/Valencia, Castelló/Castellón, La Rioja, Balears
final_data_vals[!final_data_vals %in% population_vals][1] "A Coruña" "Castelló/Castellón" "Las Palmas"
[4] "La Rioja" "NC" "Balears"
[7] "València/Valencia"
#Values of population that aren't in final_data: Total, castellón/Castelló, Palmas Las, Valencia/València, Balears Illes, Coruña A, Rioja La
population_vals[!population_vals %in% final_data_vals] [1] "Total" "Balears Illes" "Castellón/Castelló"
[4] "Coruña A" "Palmas Las" "Rioja La"
[7] "Valencia/València"
We join both databases taking into account the non matching values
final_data <- final_data |>
mutate(provincia = ifelse(provincia %in% names(equivalences),
equivalences[provincia],
provincia))
final_data <- final_data |>
left_join(population, by = "provincia")
print (final_data, width = Inf)# A tibble: 866,020 × 12
provincia_iso fecha age_group cases age_inf age_up sex
<chr> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 A 2020-01-01 00:00:00 0-9 NA 0 9 H
2 A 2020-01-01 00:00:00 10-19 NA 10 19 H
3 A 2020-01-01 00:00:00 20-29 NA 20 29 H
4 A 2020-01-01 00:00:00 30-39 NA 30 39 H
5 A 2020-01-01 00:00:00 40-49 NA 40 49 H
6 A 2020-01-01 00:00:00 50-59 NA 50 59 H
7 A 2020-01-01 00:00:00 60-69 NA 60 69 H
8 A 2020-01-01 00:00:00 70-79 NA 70 79 H
9 A 2020-01-01 00:00:00 80+ NA 80 NA H
10 A 2020-01-01 00:00:00 NC NA NA NA H
month_year provincia CCAA cum_cases poblacion
<chr> <chr> <chr> <dbl> <dbl>
1 1-2020 Alacant/Alicante VC 0 1881762
2 1-2020 Alacant/Alicante VC 0 1881762
3 1-2020 Alacant/Alicante VC 0 1881762
4 1-2020 Alacant/Alicante VC 0 1881762
5 1-2020 Alacant/Alicante VC 0 1881762
6 1-2020 Alacant/Alicante VC 0 1881762
7 1-2020 Alacant/Alicante VC 0 1881762
8 1-2020 Alacant/Alicante VC 0 1881762
9 1-2020 Alacant/Alicante VC 0 1881762
10 1-2020 Alacant/Alicante VC 0 1881762
# ℹ 866,010 more rows
ANSWER: I joined both databases but taking into account that some provinces do not coincide at all, such as “Las Palmas” and “Palmas Las” or “València/Valencia” and “Valencia/València”, so for these I created a vector of equivalences that I replaced in one of the databases to be able to join them without there being missing values
Define a function called
cum_incidence()that, given as arguments an ordered vector (by date) of cases and another parameter \(k\), calculates the cumulative incidence at \(k\) days (understood as the number of cases in the last \(k\) days per 100000 habitants). Make use of this function and create a new variable representing the cumulative incidence, in each age group, sex and province. Then, determine the 5 provinces with the highest cumulative incidence in women over 80 years of age as of March 1, 2022.
#Cumulative incidence: we create function
cum_incidence <- function(cases, k, poblacion){
cases <- ifelse(is.na(cases), 0, cases)
cases <- as.numeric(cases)
poblacion <- as.numeric(poblacion)
roll_sum <- rollsum(cases, k, fill = NA, align = "right")
incidence <- (roll_sum / poblacion) * 100000
return(incidence)
}
#Function applied
covid_incidence <- final_data |>
group_by(provincia, age_group, sex) |>
mutate(
cum_incidence = cum_incidence(cases, k = 7, poblacion = poblacion)) |>
ungroup()covid_incidence |>
filter(age_group == "80+",
sex == "M",
fecha == as.Date("2022-03-01")) |>
arrange (desc(cum_incidence)) |>
slice_head (n=5)# A tibble: 5 × 13
provincia_iso fecha age_group cases age_inf age_up sex
<chr> <dttm> <chr> <dbl> <dbl> <dbl> <chr>
1 LU 2022-03-01 00:00:00 80+ 15 80 NA M
2 ZA 2022-03-01 00:00:00 80+ 2 80 NA M
3 OR 2022-03-01 00:00:00 80+ 12 80 NA M
4 AV 2022-03-01 00:00:00 80+ 2 80 NA M
5 O 2022-03-01 00:00:00 80+ 16 80 NA M
# ℹ 6 more variables: month_year <chr>, provincia <chr>, CCAA <chr>,
# cum_cases <dbl>, poblacion <dbl>, cum_incidence <dbl>
ANSWER: We define the function cum_incidence(), within which we calculate the cumulative incidence of the last \(k\) days, dividing the total number of cases in that period by the population of each province, and then multiplying by 100,000 to obtain the rate per every 100,000 inhabitants. When applying the function we consider 7 days (because a week has 7 days). And then we filter for the profile that interests us (women over 80 years of age as of March 1, 2022.) The 5 provinces with the highest cumulative incidence in women over 80 years of age as of March 1, 2022 are Lugo, Zamora, Ourense, Ávila and Asturias
The last question does not exist and it is up to you to formulate and solve it. What question can you think of that could be interesting to ask with the current database? Why? What would be its resolution? You have absolute creative freedom to do so.
QUESTION How has the incidence rate of COVID-19 evolved over time? And what’s the difference in the incidence of Covid-19 in the different CCAA in 2020? Also analize the differences between sexes and age groups
COVID-19 OVER TIME
#First we want to see the evolution of the incidence rate of COVID-19
covid_data <- final_data |>
mutate(
tasa_incidencia = (cases / poblacion) * 100000,
month_year_new = as.Date(paste0(year(fecha), "-", month(fecha), "-01"))) |>
filter(!is.na(sex))
plot <- ggplot(covid_data, aes(x = month_year_new, y = tasa_incidencia, color = sex)) +
geom_line() +
labs(title = "Evolution of the incidence rate of COVID-19",
x = "Month",
y = "Incidence rate per 100,000 inhabitants") +
theme_minimal() +
scale_x_date(date_labels = "%b-%Y", date_breaks = "1 month") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))ANSWER:Since March 2020, we see that the incidence rate of COvid-19 is increasing as time passes although we see several drops: June 2020, March, April and October 2021. We also see that there were spikes in covid cases in July-August 2021 and also December 2021-January 2022. We cannot say much about the differences by sex since for many cases there is no data for one of the sexes, and sometimes the rates for a certain sex are very very low compared to the other sex, which really shows that data is missing (I highly doubt it is because there were really significant differences between sexes).
COVID-19 BY CCAA
#In this case we'll see the differences using ANOVA.
covid_data_2020 <- covid_data |>
filter(year(month_year_new) == 2020)
anova_result <- aov(tasa_incidencia ~ CCAA, data = covid_data_2020)
summary(anova_result) Df Sum Sq Mean Sq F value Pr(>F)
CCAA 18 33375 1854.2 987.9 <2e-16 ***
Residuals 174108 326781 1.9
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
213833 observations deleted due to missingness
ANSWER: There are significative differences between CCAA incidence rate of Covid-19 in the year 2020
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = tasa_incidencia ~ CCAA, data = covid_data_2020)
$CCAA
diff lwr upr p adj
AR-AN 0.610636031 0.55437502 0.666897040 0.0000000
AS-AN -0.174332544 -0.26384542 -0.084819666 0.0000000
CB-AN -0.086861013 -0.17511533 0.001393302 0.0597343
CE-AN 1.642582727 1.51026991 1.774895539 0.0000000
CL-AN 0.835123397 0.79271150 0.877535292 0.0000000
CM-AN 0.488666217 0.44114852 0.536183909 0.0000000
CN-AN -0.577161172 -0.64680332 -0.507519025 0.0000000
CT-AN 0.099308565 0.05192197 0.146695161 0.0000000
EX-AN 0.126474915 0.05908478 0.193865047 0.0000000
GA-AN -0.228642980 -0.28168323 -0.175602726 0.0000000
IB-AN -0.163739080 -0.24801419 -0.079463973 0.0000000
MC-AN 0.100592235 0.01554769 0.185636777 0.0046148
MD-AN 0.148335553 0.07575068 0.220920425 0.0000000
ML-AN 2.128180304 2.00722921 2.249131394 0.0000000
NA-AN 0.537110043 0.45684037 0.617379713 0.0000000
PV-AN 0.317531562 0.26423403 0.370829099 0.0000000
RI-AN 0.676623272 0.58917800 0.764068540 0.0000000
VC-AN -0.154938197 -0.20897702 -0.100899372 0.0000000
AS-AR -0.784968574 -0.88181211 -0.688125043 0.0000000
CB-AR -0.697497044 -0.79317849 -0.601815601 0.0000000
CE-AR 1.031946696 0.89456841 1.169324986 0.0000000
CL-AR 0.224487367 0.16823007 0.280744660 0.0000000
CM-AR -0.121969814 -0.18216986 -0.061769769 0.0000000
CN-AR -1.187797203 -1.26663972 -1.108954683 0.0000000
CT-AR -0.511327466 -0.57142409 -0.451230845 0.0000000
EX-AR -0.484161116 -0.56102167 -0.407300566 0.0000000
GA-AR -0.839279011 -0.90392716 -0.774630864 0.0000000
IB-AR -0.774375111 -0.86639906 -0.682351157 0.0000000
MC-AR -0.510043796 -0.60277291 -0.417314682 0.0000000
MD-AR -0.462300478 -0.54375401 -0.380846946 0.0000000
ML-AR 1.517544274 1.39107182 1.644016723 0.0000000
NA-AR -0.073525987 -0.16189642 0.014844450 0.2556200
PV-AR -0.293104469 -0.35796387 -0.228245069 0.0000000
RI-AR 0.065987241 -0.02894847 0.160922952 0.6010281
VC-AR -0.765574227 -0.83104414 -0.700104319 0.0000000
CB-AS 0.087471531 -0.03085933 0.205802391 0.4797426
CE-AS 1.816915270 1.66290213 1.970928414 0.0000000
CL-AS 1.009455941 0.91994540 1.098966483 0.0000000
CM-AS 0.662998761 0.57095910 0.755038423 0.0000000
CN-AS -0.402828628 -0.50801131 -0.297645945 0.0000000
CT-AS 0.273641109 0.18166906 0.365613158 0.0000000
EX-AS 0.300807458 0.19710211 0.404512802 0.0000000
GA-AS -0.054310437 -0.14931905 0.040698179 0.8830184
IB-AS 0.010593463 -0.10480005 0.125986973 1.0000000
MC-AS 0.274924779 0.15896814 0.390881419 0.0000000
MD-AS 0.322668097 0.21551432 0.429821874 0.0000000
ML-AS 2.302512848 2.15814341 2.446882285 0.0000000
NA-AS 0.711442587 0.59894108 0.823944090 0.0000000
PV-AS 0.491864106 0.39671162 0.587016593 0.0000000
RI-AS 0.850955815 0.73322713 0.968684500 0.0000000
VC-AS 0.019394347 -0.07617533 0.114964023 0.9999999
CE-CB 1.729443740 1.57615865 1.882728825 0.0000000
CL-CB 0.921984410 0.83373246 1.010236357 0.0000000
CM-CB 0.575527230 0.48471111 0.666343354 0.0000000
CN-CB -0.490300159 -0.59441387 -0.386186443 0.0000000
CT-CB 0.186169578 0.09542198 0.276917177 0.0000000
EX-CB 0.213335928 0.11071494 0.315956915 0.0000000
GA-CB -0.141781967 -0.23560577 -0.047958165 0.0000178
IB-CB -0.076878067 -0.19129804 0.037541909 0.6636593
MC-CB 0.187453248 0.07246537 0.302441123 0.0000017
MD-CB 0.235196566 0.12909190 0.341301235 0.0000000
ML-CB 2.215041317 2.07144883 2.358633808 0.0000000
NA-CB 0.623971056 0.51246833 0.735473778 0.0000000
PV-CB 0.404392575 0.31042309 0.498362063 0.0000000
RI-CB 0.763484285 0.64670966 0.880258906 0.0000000
VC-CB -0.068077184 -0.16246909 0.026314722 0.5289833
CL-CE -0.807459329 -0.93977056 -0.675148097 0.0000000
CM-CE -1.153916510 -1.28795167 -1.019881348 0.0000000
CN-CE -2.219743899 -2.36312284 -2.076364953 0.0000000
CT-CE -1.543274162 -1.67726290 -1.409285420 0.0000000
EX-CE -1.516107812 -1.65840652 -1.373809101 0.0000000
GA-CE -1.871225707 -2.00731671 -1.735134701 0.0000000
IB-CE -1.806321807 -1.95735090 -1.655292713 0.0000000
MC-CE -1.541990492 -1.69345028 -1.390530703 0.0000000
MD-CE -1.494247174 -1.63907831 -1.349416041 0.0000000
ML-CE 0.485597578 0.31142336 0.659771799 0.0000000
NA-CE -1.105472683 -1.25430384 -0.956641527 0.0000000
PV-CE -1.325051165 -1.46124265 -1.188859679 0.0000000
RI-CE -0.965959455 -1.11878016 -0.813138749 0.0000000
VC-CE -1.797520923 -1.93400421 -1.661037635 0.0000000
CM-CL -0.346457180 -0.39397047 -0.298943887 0.0000000
CN-CL -1.412284569 -1.48192371 -1.342645425 0.0000000
CT-CL -0.735814832 -0.78319702 -0.688432648 0.0000000
EX-CL -0.708648483 -0.77603551 -0.641261453 0.0000000
GA-CL -1.063766378 -1.11680269 -1.010730064 0.0000000
IB-CL -0.998862478 -1.08313510 -0.914589851 0.0000000
MC-CL -0.734531162 -0.81957325 -0.649489078 0.0000000
MD-CL -0.686787844 -0.75936984 -0.614205853 0.0000000
ML-CL 1.293056907 1.17210755 1.414006268 0.0000000
NA-CL -0.298013354 -0.37828042 -0.217746289 0.0000000
PV-CL -0.517591835 -0.57088545 -0.464298221 0.0000000
RI-CL -0.158500126 -0.24594300 -0.071057248 0.0000000
VC-CL -0.990061594 -1.04409655 -0.936026638 0.0000000
CN-CM -1.065827389 -1.13868871 -0.992966063 0.0000000
CT-CM -0.389357652 -0.44135990 -0.337355404 0.0000000
EX-CM -0.362191302 -0.43290321 -0.291479393 0.0000000
GA-CM -0.717309197 -0.77451071 -0.660107686 0.0000000
IB-CM -0.652405297 -0.73935953 -0.565451068 0.0000000
MC-CM -0.388073982 -0.47577414 -0.300373820 0.0000000
MD-CM -0.340330664 -0.41600964 -0.264651685 0.0000000
ML-CM 1.639514087 1.51668123 1.762346942 0.0000000
NA-CM 0.048443826 -0.03463423 0.131521886 0.8639143
PV-CM -0.171134655 -0.22857481 -0.113694498 0.0000000
RI-CM 0.187957055 0.09792695 0.277987156 0.0000000
VC-CM -0.643604414 -0.70173305 -0.585475773 0.0000000
CT-CN 0.676469737 0.60369384 0.749245634 0.0000000
EX-CN 0.703636087 0.61650244 0.790769729 0.0000000
GA-CN 0.348518192 0.27194071 0.425095671 0.0000000
IB-CN 0.413422092 0.31265932 0.514184860 0.0000000
MC-CN 0.677753407 0.57634623 0.779160586 0.0000000
MD-CN 0.725496725 0.63428595 0.816707499 0.0000000
ML-CN 2.705341476 2.57237529 2.838307667 0.0000000
NA-CN 1.114271215 1.01683374 1.211708690 0.0000000
PV-CN 0.894692734 0.81793683 0.971448640 0.0000000
RI-CN 1.253784444 1.15035564 1.357213244 0.0000000
VC-CN 0.422222975 0.34495049 0.499495456 0.0000000
EX-CT 0.027166350 -0.04345753 0.097790230 0.9980731
GA-CT -0.327951545 -0.38504420 -0.270858890 0.0000000
IB-CT -0.263047645 -0.34993030 -0.176164987 0.0000000
MC-CT 0.001283670 -0.08634553 0.088912871 1.0000000
MD-CT 0.049026988 -0.02656975 0.124623722 0.7238106
ML-CT 2.028871739 1.90608954 2.151653939 0.0000000
NA-CT 0.437801478 0.35479833 0.520804626 0.0000000
PV-CT 0.218222997 0.16089124 0.275554751 0.0000000
RI-CT 0.577314707 0.48735373 0.667275685 0.0000000
VC-CT -0.254246762 -0.31226829 -0.196225238 0.0000000
GA-EX -0.355117895 -0.42965320 -0.280582590 0.0000000
IB-EX -0.290213995 -0.38943363 -0.190994356 0.0000000
MC-EX -0.025882680 -0.12575669 0.073991327 0.9999928
MD-EX 0.021860638 -0.06764247 0.111363748 0.9999971
ML-EX 2.001705390 1.86990475 2.133506031 0.0000000
NA-EX 0.410635129 0.31479431 0.506475949 0.0000000
PV-EX 0.191056647 0.11633804 0.265775256 0.0000000
RI-EX 0.550148357 0.44822232 0.652074399 0.0000000
VC-EX -0.281413111 -0.35666228 -0.206163941 0.0000000
IB-GA 0.064903900 -0.02518703 0.154994829 0.5311649
MC-GA 0.329235215 0.23842412 0.420046315 0.0000000
MD-GA 0.376978533 0.29771539 0.456241673 0.0000000
ML-GA 2.356823284 2.23175031 2.481896255 0.0000000
NA-GA 0.765753024 0.67939735 0.852108693 0.0000000
PV-GA 0.546174542 0.48408824 0.608260846 0.0000000
RI-GA 0.905266252 0.81220307 0.998329438 0.0000000
VC-GA 0.073704784 0.01098097 0.136428594 0.0051632
MC-IB 0.264331315 0.15236846 0.376294171 0.0000000
MD-IB 0.312074633 0.20925600 0.414893262 0.0000000
ML-IB 2.291919385 2.15073768 2.433101087 0.0000000
NA-IB 0.700849124 0.59246865 0.809229596 0.0000000
PV-IB 0.481270642 0.39102800 0.571513284 0.0000000
RI-IB 0.840362352 0.72656524 0.954159460 0.0000000
VC-IB 0.008800884 -0.08188154 0.099483305 1.0000000
MD-MC 0.047743318 -0.05570692 0.151193552 0.9838513
ML-MC 2.027588069 1.88594573 2.169230412 0.0000000
NA-MC 0.436517808 0.32753796 0.545497656 0.0000000
PV-MC 0.216939327 0.12597772 0.307900938 0.0000000
RI-MC 0.576031037 0.46166294 0.690399136 0.0000000
VC-MC -0.255530432 -0.34692836 -0.164132500 0.0000000
ML-MD 1.979844751 1.84531393 2.114375576 0.0000000
NA-MD 0.388774490 0.28921247 0.488336514 0.0000000
PV-MD 0.169196009 0.08976047 0.248631543 0.0000000
RI-MD 0.528287719 0.42285503 0.633720406 0.0000000
VC-MD -0.303273750 -0.38320854 -0.223338956 0.0000000
NA-ML -1.591070261 -1.72989821 -1.452242313 0.0000000
PV-ML -1.810648742 -1.93583104 -1.685466449 0.0000000
RI-ML -1.451557032 -1.59465369 -1.308460372 0.0000000
VC-ML -2.283118501 -2.40861820 -2.157618804 0.0000000
PV-NA -0.219578481 -0.30609241 -0.133064550 0.0000000
RI-NA 0.139513229 0.02864976 0.250376694 0.0014704
VC-NA -0.692048240 -0.77902081 -0.605075672 0.0000000
RI-PV 0.359091710 0.26588165 0.452301771 0.0000000
VC-PV -0.472469759 -0.53541128 -0.409528237 0.0000000
VC-RI -0.831561468 -0.92519737 -0.737925564 0.0000000
ANSWER: In general we see that the differences in the incidence rate of covid-19 are significant between all the Autonomous Communities, except in very few cases where there is not significative differences, such as, Castilla y León - Asturias / Castilla y León - Andalusia / Valencia - Asturias / Balearic Islands - Castilla y León / Valencia - Castilla y León / Madrid - Murcia / Madrid - Catalonia / Navarra - Castilla La Mancha / Aragón - Navarra / La Rioja - Aragón/ Balearic Islands - Galicia / Galicia - Asturias.
COVID-19 BY AGE AND SEX
# We are going to do a regression. We prepare our data
#First we clean the data
covid_data_clean <- covid_data |>
filter(sex != "NC", age_group != "NC") |>
drop_na(tasa_incidencia, sex, age_group)
#As factor
covid_data_clean <- covid_data_clean |>
mutate(
sex = factor(sex, levels = c("H", "M")),
age_group = factor(age_group)
)# Lineal Regression
modelo_regresion <- lm(tasa_incidencia ~ age_group + sex, data = covid_data_clean)
summary(modelo_regresion)
Call:
lm(formula = tasa_incidencia ~ age_group + sex, data = covid_data_clean)
Residuals:
Min 1Q Median 3Q Max
-3.597 -2.054 -1.110 0.010 106.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.12320 0.02122 100.05 <2e-16 ***
age_group10-19 0.76159 0.02822 26.99 <2e-16 ***
age_group20-29 0.64002 0.02778 23.04 <2e-16 ***
age_group30-39 0.79239 0.02760 28.71 <2e-16 ***
age_group40-49 1.27170 0.02743 46.35 <2e-16 ***
age_group50-59 0.43759 0.02755 15.88 <2e-16 ***
age_group60-69 -0.53398 0.02792 -19.13 <2e-16 ***
age_group70-79 -1.02664 0.02862 -35.87 <2e-16 ***
age_group80+ -0.88369 0.02890 -30.58 <2e-16 ***
sexM 0.21650 0.01295 16.72 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.693 on 525408 degrees of freedom
Multiple R-squared: 0.0261, Adjusted R-squared: 0.02608
F-statistic: 1565 on 9 and 525408 DF, p-value: < 2.2e-16
ANSWER: First we observe that it is a significant model with a p-value below the confidence level (<0.05) although it has a low R-squared (sex and age only manage to explain 2% of the variance of the Covid-19 incidence rates). Even so, we see that both variables are significant in the model as well as all categories.
We can say that with respect to sex, being female (M-mujer) is associated with an increase of 0.2165 units in the incidence rate of COVID-19 compared to being male (H-hombre), holding other variables constant. (as I previously said, that can be explained because there are some missing value, and some of them among males, so there’s a bias)
Regarding the age groups we see that the reference age group is 0-9 years, and what we see is that those with positive coefficients (10 to 60 years) indicate that the incidence rates of COVID-19 are higher for these age groups compared to the younger ones (0 to 9 years). In contrast, those older than 60 years would have lower incidence rates than the younger ones. This may be explained by the fact that there was high protection of the elderly during the COVID-19 months.